Part = 1

Group Data into similar clusters

Now, we will use K-Means clustering to group data based on their attribute. First, we need to determine the optimal number of groups. For that we conduct the knee test to see where the knee happens.

Though the bend is not coming out clearly as there are many bends, let us look at 2,3 and 4clusters

WSS reduces as K keeps increasing

Calculating WSS for other values of K - Elbow Method¶

KMeans with K=3

silhouette score is better for 2 clusters than for 3 &4 clusters. So, final clusters will be 2

Appending Clusters to the original dataset

Applying PCA

PART = 2

Scaling the data

KMeans with K=3
KMeans with K=2
silhouette score is better for 2 clusters than for 3 clusters. So, final clusters will be 2Appending Clusters to the original dataset

Part = 3

# Q3. Standardize the data

Using hierarchial clustering

Ploting the dendrogram for the consolidated dataframe¶
From the truncated dendrogram, find out the optimal distance between clusters which u want to use an input for clustering data
Use this distance measure(max_d) and fcluster function to cluster the data into 3 different groups
Using matplotlib to visually observe the clusters in 2D space

Applying PCA

Dimensionality Reduction Now 10 dimensions seems very reasonable. With 8 variables we can explain over 95% of the variation in the original data!

Fit SVM Classier

Lets construct two SVM models. The first with all the independent variables and the second with only the 8 new variables constructed using PCA.
Looks like by drop reducing dimensionality by 9, we only dropped around 4% in R^2! This is insample (on training data) and hence a drop in R^2 is expected. Still seems easy to justify the dropping of variables. An out of sample (on test data), with the 10 independent variables is likely to do better since that would be less of an over-fit.

Part = 4

Part = 5

Listing down all possible dimensionality reduction techniques that can be implemented using python

Below is the different possible dimensionality reduction techniques that can be implemented using python.We will now look at various dimensionality reduction techniques-
question = 2
And as you can see we've taken this simple model from ~30% accuracy on the test set to ~65%